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Abstract 

Hardware-aware design and optimization is crucial in exploiting emerging architectures for PDE-based computational fluid dy¬ 
namics applications. In this work, we study optimizations aimed at acceleration of OpenFOAM-based applications on emerging 
hybrid heterogeneous platforms. OpenFOAM uses MPI to provide parallel multi-processor functionality, which scales well on 
homogeneous systems but does not fully utilize the potential per-node performance on hybrid heterogeneous platforms. In our 
study, we use two OpenFOAM applications, icoFoam and laplacianFoam, both based on Krylov iterative methods. We propose a 
number of optimizations of the dominant kernel of the Krylov solver, aimed at acceleration of the overall execution of the applica¬ 
tions on modern GPU-accelerated heterogeneous platforms. Experimental results show that the proposed hybrid implementation 
significantly outperforms the state-of-the-art implementation. 

(©2014 Published by Elsevier Ltd. 
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1. INTRODUCTION 

Current HPC systems are complex and difficult to utilize and manage at their extremes. The heterogeneity of 
these platforms leads to several challenges and much contemporary attention is devoted to new software solutions. 
However, CFD packages, until recently, have been aimed at homogeneous systems and the parallel approach is mostly 
based on the process-level Message Passing Interface (MPI). This trend in the HPC platforms invites redesign of the 
CFD packages or the algorithms themselves to use these platforms efficiently. 

This paper proposes optimizations aimed at accelerating numerical simulations on modem and emerging hybrid 
heterogeneous platforms, which are illustrated in two OpenFOAM [1] applications: icoFoam and laplacianFoam, 
which are representative of many low-order discretization of incompressible flow. We selected the OpenFOAM CFD 
package because of its popularity as an open source library with broad adoption. Both icoFoam, which is an incom¬ 
pressible flow solver, and laplacianFoam, which solves the Poisson equation, for e.g., thermal diffusion, are based on 
Krylov iterative methods such as conjugate gradient method. 
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Figure 1. The attainable performance on: A) a GeForce GTX Titan, B) a Tesla C2050 (Fermi) node, according to the roofline model combined with 
the low memory bandwidth of the device measured by stream benchmark. 


The behavior of the conjugate gradient method (CG) has been extensively studied. The performance of this method 
per iteration depends on the performance of each individual kernel. There are three primary kernels in CG; vector 
inner products, vector-vector updates, and sparse matrix-vector multiplication. The parallel-run of the conjugate gra¬ 
dient on a distributed memory architecture using MPI parallelizes the computations and introduces communications 
as well. Two types of communications are necessary to support the algorithm. The first type is point-to-point com¬ 
munication during the matrix-vector multiplication in order to include the influence of the parallel interfaces with 
neighbor processors. The second one is a collective communication to reduce the sum of scalars as part of forming 
inner products. It occurs three times while calculating global variable and updating the residual norm. Figure 1 illus¬ 
trates the roofline model [2] of both Fermi and Titan GPUs platform, which shows the attainable performance when it 
is limited by either device memory or PCIe bandwidth. The stream memory bandwidth is measured using our imple¬ 
mentation of the STREAM benchmark [3]. Utilizing such heterogeneous platform for memory bound kernel, CG for 
instance, leads to two main challenges; data movements and workload balancing across hybrid heterogeneous devices. 

The proposed optimizations are aimed at accelerating the icoFoam and laplacianFoam solvers on heterogeneous 
hybrid nodes beyond their current performance. A hybrid conjugate gradient solver is designed and implemented 
combining MPI and CUBA routines. A load-balancing step is applied using heterogeneous decomposition, which 
decomposes the computations taking into account the performance of each computing device and minimizing com¬ 
munication. The heterogeneous decomposition method consists of two steps; building the accurate performance 
model of the application using the approach of FuPerMod [4, 5, 6, 7], and then using this performance model as input 
to MeTiS [8] /SCOTCH [9] libraries. In addition, we present a new pipelined conjugate gradient solver as an algo¬ 
rithmic improvement inspired by the recent work of Ghysels and Vanroose [10] and implemented using MPI, CUBA, 
and a hybrid technique. 

This paper is organized as follows. Section 2 provides a brief introduction to OpenFOAM and the applications 
selected for this study. Section 3 outlines related work. The proposed optimizations are presented in Section 4. Section 
5 gives experimental evaluation of the performance of our optimizations. Section 6 concludes the paper. 

2. OpenFOAM 

Open source Field Operation And Manipulation (OpenFOAM) [1] is a library written in C-i-H- used to solve partial 
differential equations (PBEs). It features a wide range of solvers employed in CEB, such as Laplace and Poisson 
kernels, incompressible flow, multiphase flow, and user-defined models. OpenEOAM uses MPI to provide parallel 
multi-processor functionality, which scales well on homogenous systems but does not fully utilize potential per-node 
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performance on hybrid heterogeneous platforms. 

The OpenFOAM main library provides a solution framework, including mesh handling, finite volume discretiza¬ 
tion method, linear system solvers, data structure and input and output handling. An attraction of OpenFOAM is its 
modularity, which leads to a flexible design by using C-i-H- object oriented structures. One of the modules consists 
of the linear algebra kernels; this is instrumental in almost all the PDF solver codes. This work selects two applica¬ 
tions from the OpenFOAM standard applications set. The first is icoFoam, which solves the incompressible laminar 
Navier-Stokes equations. It applies the Pressure Implicit with Splitting of Operators (PISO) [11] algorithm in a time 
stepping loop. The governing equations are the incompressible continuity equation (1) and momentum equation (2). 

V.M = 0, (1) 

^ -H V . (mm) - V . (vVm) = -Vn. (2) 

ot 

Here m is the velocity vector, v is the kinematic viscosity, and p is the pressure. The PISO algorithm consists of three 
main stages. The first stage is the momentum predictor, which solves the momentum equation by using an initial or 
previous pressure field. This solution of the momentum equation gives the velocity field that is not divergence free 
but approximately satisfies the momentum equation. The second stage is the pressure solution. Third is the explicit 
velocity correction stage. The velocity field is corrected as a consequence of the pressure distribution. The velocity 
correction is performed in an explicit manner. The Conjugate Gradient (CG) and BiConjugate Gradient Stabilized 
(BiCGSTAB) linear solvers are used to solve the pressure and velocity fields respectively. 

The second selected application is laplacianFoam, which is used to find the solution of the Laplacian equation. The 
equation contains one variable, a passive scalar, for instance, a temperature, T. The matrix is formed and computed as 
its time-implicit Helmhpltz form, as in [1], for: 

5 - V(Dr ■ r) = 0. (3) 

ot 

The linear solver that is used to solve the temperature in laplacianFoam is CG [12], which is one of the best-known 
Krylov subspace methods. CG is used to solve a system of linear equations in the form: 

Ax = b. (4) 

Here x is an unknown vector of size the number of cells in the computational domain, b is of corresponding length, and 
A is a given square, symmetric, positive-definite or semidefinite matrix. The main building blocks of a CG iteration, 
as formerly mentioned, are: a sparse matrix vector multiply kernel, an optional preconditioning, three global dot 
products, and vector scalings and vector-vector additions. 

2.7. Parallel Approach 

A common approach for parallelizing partial differential equations is domain decomposition. The mesh and its 
associated fields are divided into subdomains and allocated to separate processors. OpenFOAM applies domain de¬ 
composition to enable process-level parallelism [1]. Thus, the program can run in parallel on separate subdomains, 
with communications between processors using the MPI communication protocol. The number of subdomains should 
be equal to (or greater than) the number of processors. Parallel execution engages communication and synchronization 
between the processors. There are two types of communications: communications between processors hosting the 
neighboring subdomains, and global communication involving all the processors. As with most mesh-based applica¬ 
tions, the communication scheme between processors is based on the halo-layer approach with overlapping elements, 
which duplicates the cells next to a processor boundary (internal boundary). The halo-layer covers all internal bound¬ 
aries and is explicitly updated through communications between processors. 

The most time consuming parts are the sparse linear algebraic kernels. Parallel CG engages point-to-point com¬ 
munication between neighboring processors during the sparse matrix vector kernel and global synchronization for 
every dot product. 
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3. RELATED WORK 

In the past decade, a number of parallel packages designed for CFD have been proposed including OpenFOAM 
[1]. Many of these packages are aimed at homogeneous systems, use MPI-based bulk synchronous processing par¬ 
allelization, and pay no special attention to the underlying hardware. The addition of GPUs to high-end scientific 
computer systems anticipates new achievable performance levels. As a result, several works attempt to employ GPG- 
PUs in CFD codes and much more attention is focused on sparse linear algebra kernels, such as implementing the 
conjugate gradient on GPU [13]. These works include the Cufflink library [14], which extends the OpenFOAM linear 
solvers capabilities to perform on GPUs. 

Pioneering CFD works have addressed heterogeneous and hybrid systems. A parallel simulation of oil extrac¬ 
tion has been designed and optimized for heterogeneous networks of computers [15] by applying the performance 
modeling of the target platform and model-based domain partitioning. Gropp [16] studied the hybrid MPI-i-OpenMP 
programming model on unstructured implicit CFD solvers and its affects on obtaining per-processor efficiency and 
parallel efficiency. Accordingly, there have been some attempts to obtain a hybrid parallelization in CFD on multi-core 
platforms [17] such as the CFD solver TAU [18] that provides hybrid MPI-i-OpenMP parallelization. There are matrix 
and graph partitioning libraries, which consider heterogeneity of the devices, such as MeTiS [8] and SCOTCH [9]. 
These two libraries assume a given vector of positive constants, representing the relative volume of computations to 
be performed by each processor, to partition a given domain into subdomains, assuming one-to-one mapping between 
the subdomains and the processors. 

Recently, there have been attempts to investigate the parallel model of hybrid MPIh-GPGPU. Papadrakakis [19], 
attempts to balance the computation across heterogeneous hybrid CPU-i-GPU system. The balancing is performed at 
runtime by using task-based parallelism and migrations between the compute devices. On the other hand, a general 
framework is supported by StarPU [20] project, which provides and supports task-based programming and scheduling 
the provided tasks on heterogeneous platform, which combines CPUs and GPUs. The OP2 project [21] provides a 
framework to implement CFD applications using unstructured meshes on different computational hardware including 
multi-cores and GPUs (many-cores) systems. The parallelization scheme is similar to the task-based parallelism in the 
way it handles the data dependencies. However, it is not applicable to multi-block codes as is the case in OpenFOAM 
and there is no special attention to the heterogeneity in the domain decomposition method, which is performed at the 
MPI level. Moreover, it does not yet extend to multiple GPUs, which is an important direction as the communication 
architecture of GPUs evolves, ultimately bypassing the PCI-express (PCIe) protocol. 

4. PROPOSED OPTIMIZATIONS 

The main motivation of this work is to increase the performance of the selected applications and maintain an 
effective use of the allocated resources. The first solution algorithm proposed here is the hybrid conjugate gradient, 
which solves the system of linear equations derived from the partial differential equations using the original conjugate 
gradient method on heterogeneous hybrid platform efficiently. Second, algorithmic improvements of the CG solver 
that are inspired by the recent work of Ghysels and Vanroose [10], which introduces more computational steps un¬ 
necessary in the original CG while reducing the global communication into one per loop body instead of three, have 
been implemented in three parallel versions: CUDA, MPI, and hybrid MPIh-CUDA. Given the trends of computer 
architecture, this is a very useful option, as we demonstrate. Third, a load-balancing step is applied using heteroge¬ 
neous decomposition, which decomposes the computations taking into account the performance of each computing 
device and minimizing communication. The heterogeneous decomposition combines the performance model and the 
MeTiS/SCOTCH libraries. 


4.1. Hybrid CG and Hybrid Pipelined CG 

The main idea is to distribute the computations across heterogeneous devices, some of them being GPUs, and 
run these computations in parallel. All the processors employ a process-level parallelism using MPI and each MPI- 
process, which is connected to a GPU, accelerates the computation using CUDA. Those processors, which are not 

4 



A. AlOnazi et at. / Computers and Fluids 00 (2015) 1-12 


5 



Figure 2. A 128 x 128 x 128 3D mesh evenly decomposed using MeTis into 16 subdomains, the view is the face at the positive z-axis. 


connected to a GPU run the computation locally. We use the CUSP [22] and Thrust [23] libraries for implementing 
CUBA kernels. The computational flow on the CPU and on the GPU starts by decomposing the computational domain 
into p subdomains using OpenFOAM decomposePar function, which invokes the domain decomposition method to 
partition the mesh to a given number of subdomains. The application repeatedly invokes the linear solver, which is, 
in our case, the hybrid CG. The communication scheme of one iteration requires three global communications and 
one point-to-point communication between each pair of neighboring subdomains. The matrix is derived from the 
discretized equation, sparse, and of the dimension of the number of finite volumes. In case of a decomposed com¬ 
putational domain, the local matrix row dimension is number of cells within the subdomain. The hybrid CG solver 
supports common sparse matrix storage format that is compressed sparse row (CSR). 

We implement the pipelined conjugate gradient algorithm of Ghysels and VanRoose [10]. It can be considered 
as a communication startup-reducing algorithmic improvement of the original method. In terms of communication, 
the main kernels in CG algorithm are sparse matrix-vector multiplication, which requires point-to-point communi¬ 
cation among neighboring subdomains, and three dot products, which involve global communication and require 
participation of all subdomains. The algorithm modifies and reorders the s-step CG method, which is introduced by 
Chronopoulos and Gear in their original paper [24] and minimizes the global communication to only one collective 
communication per loop body instead of three where the global synchronization can be overlapped by the sparse 
matrix-vector multiplication kernel instead of updating a vector. Given that the runtime of a vector update is not 
enough to hide the latency of the global communication, pipelining CG will enhance the performance. Basically, it 
offers opportunity of better scalability at the price of extra computations, which are relatively cheaper. 

The hybrid pipelined CG combines the MPI pipelined CG and CUBA pipelined CG kernels. The first commu¬ 
nication inside the loop is a single reduction followed by matrix-vector multiplication, which computes the local 
computations and requires talking with neighboring subdomains to update the processor boundary. The global reduc¬ 
tion can be hidden using the asynchronous global communication. However, the point-to-point communication can 
be between two hosts, two CPUs kernel, or hybrid. These communications require data movements between host and 
device. All solvers are integrated into OpenFOAM as a dynamic library. 


4.2. Heterogeneous Decomposition 

The shape and balance of the domain decomposition play an important role in the parallel run. If the domain is 
not decomposed properly to minimize the number and size of the boundaries between pairs of processors, the parallel 
run will suffer from an unbalanced workload execution and communications overhead. The main idea behind the 
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Figure 3. A 128 x 128 x 128 3D mesh decomposed using heterogeneous decomposition into 16 subdomains to be allocated to a machine of 16 
cores and 2 GPUs, the view is the face at the positive z-axis. 


heterogeneous decomposition is to adequately partition and assign the subdomains to the processors in proportion to 
their performance. Also, the technique should take into account minimizing the communications, so as to minimize 
inter-processor communications and network congestion. As described in previous sections, OpenFOAM provides 
static load balancing during the decomposition process through some third-party library, e.g., MeTiS or SCOTCH, 
prior to running the program. MeTiS and SCOTCH both allow the user to provide a constant number per proces¬ 
sor that represents the percentage of data to be computed by this processor. However, neither library provides any 
methods or algorithms to compute this percentage for balanced workload. It is up to the user to supply these indicators. 

In a nutshell, the heterogeneous decomposition combines the performance model [4, 7] and the MeTiS/SCOTCH 
library. It estimates the relative speed of each compute device in the target platform by running the computational 
kernel on evenly distributed subdomains, and measures the execution time per processor. Then, it computes the speed 
of each processor, and builds the performance model of the application. The output of this step is a vector of numbers 
representing the computational volume per-processor accordance to its speed. The processor computational load will 
be balanced when each processor’s computational work divided by its relative speed is equal. The work shares are 
provided to MeTiS or SCOTCH to re-decompose the data. 

Let us assume the hybrid CG solver is the computational kernel to be optimized using the heterogeneous decom¬ 
position on structured grid of N cells. The input is a square sparse matrix A,, which contains «,■ x «, elements, where 
n, is number of mesh cells allocated to processor p,. The number of non-zeros in A, is n, + 2o,', where o, is number 
of off-diagonal elements in the upper-triangle of matrix A,-. It is a half-stencil size, i.e., for 3D Poisson, o, is three 
per cell. Then, let us compute the speed i, of processor p, during computing the CG kernel as number of non-zeros 
n, -H 2o, divided by the execution time T per iteration, which can be formulated in equation (5). The asymptotical time 
complexity of CG kernel per iteration is proportional to number of non-zeros [12]. 


«/ H- 20i 

Siini) 

T(ni) 

(5) 

, , Siirii) 

nitii) = 

Lp Siini) 

(6) 


(7) 

P 


ni,new ririi). 

(8) 
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Then, the relative speed r, is computed according to equation (6). After that, the new assigned number of cells 
to be allocated at processor p, is computed using equation (8). This vector of p numbers will be passed as argument 
to MeTiS/SCOTCH API to re-decompose the computational domain correspondingly. Figure 2 illustrates a cubic 
domain that is decomposed evenly into 16 subdomains. If we apply the heterogenous decomposition on a machine 
consists of 16 cores and 2 GPUs, the decomposed mesh is illustrated in Figure 3 where the large subdomains at the 
left is associated to the GPUs. 


5. EVALUATION 

Our library is implemented in CUDA, C and C++ and plugged into OpenFOAM-extl.6. The experimental plat¬ 
forms are: 

• A single node consisting of two NUMA sockets; each socket is connected to GPU device Tesla C2050. One 
socket has 6 cores Intel Xeon CPU X5670 @ 2.93GHz. 

• A single node consisting of two NUMA sockets, which are connected to two GPU devices: GeForce GTX 
TITAN. One socket has 8 cores Intel Xeon CPU E5-2650 0 @ 2.00GHz. 

We used two test cases in our experiments: 

• The lid-driven cavity flow test case contains the solution of a laminar, isothermal and incompressible flow within 
a three-dimensional cubic geometry using the icoFoam solver. The top boundary of the cube is a moving wall 
that moves in the x direction, while the rest are static walls. 

• The heat equation with Dirichlet boundary condition test case is solved over a three-dimensional cubic geometry 
using laplacianFoam. 


v 

E 

H 



laplacianFoam icoFoam 


Figure 4. Execution time of the selected OpenFOAM applications. Both internally call either OpenFOAM CG solver on CPU or Cufflink CG on 
GPU Tesla C2050 with convergence tolerance of 10“^, for a 100 x 100 x 100 3D domain. 


Figure 4 shows a comparison between OpenFOAM-CG and Cufflink-CG, which was running on a Tesla C2050. 
Cufflink is an open source library that extends the CG and BiCG solvers of OpenFOAM library to be executed on 
GPU. We measure the execution time per one step running of icoFoam and laplacianFoam as shown in the x-axis. We 
observe that Cufflink CG was able to reduce the execution time by approximately 40% when compared with Open¬ 
FOAM CG. 


7 







A. AlOnazi et at. / Computers and Fluids 00 (2015) 1-12 


8 



Metis Heterogeneous 

Decomositon Methods 


Figure 5. Heterogeneous decomposition compared against homogeneous MeTiS decomposition on a Tesla node, for a 100 x 100 x 100 3D domain. 


The heterogeneous decomposition is compared against homogeneous MeTiS decomposition on one Tesla node, 
as shown in Figure 5. This test uses the hybrid solvers, which solve a system of linear equations for a 7-point Lapla- 
cian in a 3D slab consisting of a million cells. The computational domain is decomposed into two subdomains. One 
subdomain is computed on the GPU Tesla C2050, whereas the other one on the CPU. Figure 5 shows that the het¬ 
erogeneous decomposition provides improvement of the average time per iteration around 20%. The heterogeneous 
decomposition is based on empirical measurements, equation (8), so does not depend upon an intricate model. It 
is generalizable to many problems. However, the CFD applications may consist of different computational kernels, 
which require more auto-tuning using the heterogeneous decomposition. This step is noted for future work, and the 
current implementation is a stepping stone towards balancing computational kernels across highly heterogeneous de¬ 
vices. In this work, we try to optimize the most computational intensive phase in the application, and show how this 
optimization can accelerate the whole applications. 




Processors Processors 

a) b) 

Figure 6. Hybrid solvers strong scaling measured as speedup of the execution time relative of different parallel run over the icoFoam with a 
tolerance of 10“^ calling Cufflink-CG on a Tesla node. The mesh size in a) consists of 100,000 control volumes, and in b) of 1 million control 
volumes. 
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a) 



Processors 

b) 


Figure 7. Hybrid solvers strong scaling measured as speedup of the execution time relative of different parallel run over the laplacianFoam with 
a tolerance of calling Cufflink CG on a Tesla node. The mesh size in a) consists of 100,000 control volumes, and in b) of I million control 
volumes. 


We examine the speedup on the end-to-end computation of icoFoam and laplacianFoam on a Tesla node. Figure 
6, shows the speedup of the total execution time of icoFoam using hybrid solvers over icoFoam using Cufflink CG on 
a GPU. The test case used to benchmark icoFoam is a 3D lid-driven cavity flow. It shows approximately 2x speedup 
on fixed problem size. The poor strong scalability is due to the limitation of the unoptimized stages of PISO algo¬ 
rithm. We only optimized the second stage, pressure solution, using hybrid solvers. Figure 7 shows the speedup of 
the total execution time of laplacianFoam using hybrid solvers over laplacianFoam using Cufflink GPU CG. The test 
case is a 3D heat equation. We observe a speedup about 7x in a) and 3.5x in b) better relative to Cufflink CG on a GPU. 



IM 2M 4M 

Problem Size 


Figure 8. Time per iteration comparison of pipelined CG MPI only, hybrid CG, and hybrid pipelined CG on different mesh sizes running on one 
socket, which is connected to a NVIDIA GeForce GTX TITAN GPU device, for a 7-point Laplacian in a 3D slab with a tolerance of 10“®. 


We also study the performance on an advanced GPU device, i.e., a Geforce GTX TITAN. In Figure 8, we examine 
the performance of the CG solvers on TITAN, and compare three variants: pipelined CG on CPU (PipeCG CPU), 
CG on GPU (CG GPU), and pipelined CG on GPU (PipeCG GPU). While the execution time increases with mesh 
size, the rate of increase — slope of graph — for CPU solver is steeper than GPU solvers, indicating that both GPU 
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solvers also achieves better performance. The performance of hybrid solvers is limited by device memory and PCIe 
bandwidth. The highest performance achieved is around 800 MFLOPs/s on Fermi and 2 GFLOPs/s on Titan, there¬ 
fore, the current implementation is PCIe bandwidth limited according to the roofline model Figure 1. One of the main 
bottlenecks in achieving linear scalability and better efficiency, as shown in Figures 1, is the interface between the 
CUBA and the OpenFOAM structures. We suggest that OpenFOAM core classes such as IduMatrix should be revised 
to allow for a storage of small, contiguous cache-friendly blocks of scalars, and thus potentially reduce the number of 
cache misses due to the random access patterns in the current data structures. 


Table 1. Arithmetic Intensity of Conjugate Gradient Kernels, where N\ no. of rows, NNZ: number of non-zeros, and N ^ INNZ. 


Kernel 

FLOPS 

BYTEs 

Arithmetic Intensity 

Sparse Matrix-Vector Multiplication (CSR) 

14A 

112A 

0.125 

Vector-Vector Update 

2*N 

24* A 

0.083 

Vector Inner Product 

2*N 

16 * A 

0.125 


As mentioned previously, the performance of the CG algorithm per loop depends on the performance of each 
individual kernel. The arithmetic intensity of these kernels are tabulated in Table 1. Such low arithmetic intensity 
demands high memory-bandwidth, and thus many flops are lost waiting for data. Given that the operations in the CG 
method are memory bound, the best performance is reached when the memory throughput is maximized. Combining 
the vector operations into larger kernel allows better hiding of memory latency. There are two ways to optimize the 
vector operations on GPUs; first by processing multiple vectors that will initiate more memory transactions, and thus 
the smaller the vector size the better the performance. The other way is processing one vector of large size that is, 
apparently, the only way to maximize the performance of the reduction operation; because it can not be combined 
with other operations. Therefore, the vector reduction is a bottleneck. In addition, it requires data transfer over PCIe 
between host and device, which is an order of magnitude slower than the bandwidth of the memory bus on GPU. 
These data movements also appear in the sparse matrix-vector multiplication while updating the interfaces among 
neighboring subdomains. These communications negatively and significantly affect the achievable performance. 



A) 


OpenFoam CG 
OpenFoam AMG-CG 
■■GPU CG 
I ^ GPU AMG-CG 
■■GPUPipeCG 

GPU AMG-PipeCG 
CUSP AMG-CG 


Inini 


1 



128x 128x 128 


B) 


Figure 9. Execution time and time per iteration of a steady state 7-point Laplacian with Drichlet boundary conditions in a 3D cube solved using 
dilferent none/preconditioned CG solvers with a tolerance of 10“^ on different mesh sizes running on one socket, which is connected to a NVIDIA 
GeForce GTX TITAN GPU device. 
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In Figure 9, we apply an aggregation-based algebraic multigrid preconditioner [25] to hybrid CG and hybrid Pipel- 
nied CG solvers and compare it to OpenFOAM (CPU) and CUSP (GPU). The AMG preconditioner is implemented in 
the CUSP library. It performs a V-cycle of an aggregation-based AMG with weighted Jacobi iterations with w = | as 
a smoother within multigrid [25, 22]. In OpenFOAM , an agglomeration-based algebraic multigrid preconditioner is 
applied to its CG solver that performs a V-cycle of an agglomeration-based AMG with diagonal Incomplete-Cholesky 
smoother. Our hybrid solvers are preconditioned using an aggregation-based AMG included within CUSP library. The 
increase in the number of iterations required to converge by our solvers on GPU compared to OpenFOAM AMG-CG, 
which is performed on a CPU sequentially, indicates the effects of the following factors: irregularly shaped aggregates 
generated by the parallel aggregation method based on a distance-2 maximal-independent-set within AMG [25], the 
smoother, and the multigrid hierarchies. Despite that, our solvers are faster than OpenFoam solver, as shown in Figure 
9, also GPU AMG-CG performs better than CUSP AMG-CG indicating a better implementation of CG. 


Table 2. Number of iterations to solve a steady state 7-point Laplacian with Drichlet boundary conditions in a 3D cube using different 
none/preconditioned CG solvers with a tolerance of 10“* on different mesh sizes running on one socket, which is connected to a NVIDIA GeForce 
GTX TITAN GPU device. 


Mesh Size 

N=64 

N=128 


Mesh Size 

N=64 

N=128 

OpenFOAM CG 

346 

695 


OpenFOAM AMG-CG 

7 

10 

GPUCG 

325 

636 


GPU AMG-CG 

28 

32 

GPU Pipe CG 

157 

297 


GPU AMG-PipeCG 

28 

32 

- 

- 

- 


CUSP AMG-CG 

32 

37 


6. CONCLUSIONS 

Memory bound applications, such as the OpenFOAM solvers, can take better advantage of the full hardware po¬ 
tential, which is now complex, hybrid and heterogeneous, if all resources are taken into account in a holistic approach. 
While many questions of ultimately attainable per node performance and multi-node scaling remain, the experimental 
results show that the hybrid implementation of both solvers outperforms state-of-the-art implementations of a widely 
used open source package. This work can be seen as a stepping stone towards optimized hybrid heterogenous CFD 
applications. The current limitations of the hybrid solvers and the heterogenous decomposition can be solved by more 
auto-tuning and dynamic load balancing scheduling, which adaptively balances the workload of different computa¬ 
tional kernels during the run-time, by memory-aware work stealing. 
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